Search results for "Information Storage and Retrieval"
showing 10 items of 48 documents
Reactome graph database: Efficient access to complex pathway data
2018
Reactome is a free, open-source, open-data, curated and peer-reviewed knowledgebase of biomolecular pathways. One of its main priorities is to provide easy and efficient access to its high quality curated data. At present, biological pathway databases typically store their contents in relational databases. This limits access efficiency because there are performance issues associated with queries traversing highly interconnected data. The same data in a graph database can be queried more efficiently. Here we present the rationale behind the adoption of a graph database (Neo4j) as well as the new ContentService (REST API) that provides access to these data. The Neo4j graph database and its qu…
FASTdoop: A versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications
2017
Abstract Summary MapReduce Hadoop bioinformatics applications require the availability of special-purpose routines to manage the input of sequence files. Unfortunately, the Hadoop framework does not provide any built-in support for the most popular sequence file formats like FASTA or BAM. Moreover, the development of these routines is not easy, both because of the diversity of these formats and the need for managing efficiently sequence datasets that may count up to billions of characters. We present FASTdoop, a generic Hadoop library for the management of FASTA and FASTQ files. We show that, with respect to analogous input management routines that have appeared in the Literature, it offers…
Harmonising and linking biomedical and clinical data across disparate data archives to enable integrative cross-biobank research
2015
A wealth of biospecimen samples are stored in modern globally distributed biobanks. Biomedical researchers worldwide need to be able to combine the available resources to improve the power of large-scale studies. A prerequisite for this effort is to be able to search and access phenotypic, clinical and other information about samples that are currently stored at biobanks in an integrated manner. However, privacy issues together with heterogeneous information systems and the lack of agreed-upon vocabularies have made specimen searching across multiple biobanks extremely challenging. We describe three case studies where we have linked samples and sample descriptions in order to facilitate glo…
Active learning strategies for the deduplication of electronic patient data using classification trees.
2012
Graphical abstractDisplay Omitted Highlights? Active learning for medical record linkage is used on a large data set. ? We compare a simple active learning strategy with a more sophisticated variant. ? The active learning method of Sarawagi and Bhamidipaty (2002) 6] is extended. ? We deliver insights into the variations of the results due to random sampling in the active learning strategies. IntroductionSupervised record linkage methods often require a clerical review to gain informative training data. Active learning means to actively prompt the user to label data with special characteristics in order to minimise the review costs. We conducted an empirical evaluation to investigate whether…
BlotBase: a northern blot database.
2008
With the availability of high-throughput gene expression analysis, multiple public expression databases emerged, mostly based on microarray expression data. Although these databases are of significant biomedical value, they do hold significant drawbacks, especially concerning the reliability of single gene expression profiles obtained by microarray data. Simultaneously, reliable data on an individual gene's expression are often published as single northern blots in individual publications. These data were not yet available for high-throughput screening. To reduce the gap between high-throughput expression data and individual highly reliable expression data, we designed a novel database "Blo…
Automatic detection of large dense-core vesicles in secretory cells and statistical analysis of their intracellular distribution.
2010
Analyzing the morphological appearance and the spatial distribution of large dense-core vesicles (granules) in the cell cytoplasm is central to the understanding of regulated exocytosis. This paper is concerned with the automatic detection of granules and the statistical analysis of their spatial locations in different cell groups. We model the locations of granules of a given cell as a realization of a finite spatial point process and the point patterns associated with the cell groups as replicated point patterns of different spatial point processes. First, an algorithm to segment the granules using electron microscopy images is proposed. Second, the relative locations of the granules with…
Upport vector machines for nonlinear kernel ARMA system identification.
2006
Nonlinear system identification based on support vector machines (SVM) has been usually addressed by means of the standard SVM regression (SVR), which can be seen as an implicit nonlinear autoregressive and moving average (ARMA) model in some reproducing kernel Hilbert space (RKHS). The proposal of this letter is twofold. First, the explicit consideration of an ARMA model in an RKHS (SVM-ARMA 2k) is proposed. We show that stating the ARMA equations in an RKHS leads to solving the regularized normal equations in that RKHS, in terms of the autocorrelation and cross correlation of the (nonlinearly) transformed input and output discrete time processes. Second, a general class of SVM-based syste…
Kernel manifold alignment for domain adaptation
2016
The wealth of sensory data coming from different modalities has opened numerous opportu- nities for data analysis. The data are of increasing volume, complexity and dimensionality, thus calling for new methodological innovations towards multimodal data processing. How- ever, multimodal architectures must rely on models able to adapt to changes in the data dis- tribution. Differences in the density functions can be due to changes in acquisition conditions (pose, illumination), sensors characteristics (number of channels, resolution) or different views (e.g. street level vs. aerial views of a same building). We call these different acquisition modes domains, and refer to the adaptation proble…
Characterization of entropy measures against data loss: Application to EEG records
2012
This study is aimed at characterizing three signal entropy measures, Approximate Entropy (ApEn), Sample Entropy (SampEn) and Multiscale Entropy (MSE) over real EEG signals when a number of samples are randomly lost due to, for example, wireless data transmission. The experimental EEG database comprises two main signal groups: control EEGs and epileptic EEGs. Results show that both SampEn and ApEn enable a clear distinction between control and epileptic signals, but SampEn shows a more robust performance over a wide range of sample loss ratios. MSE exhibits a poor behavior for ratios over a 40% of sample loss. The EEG non-stationary and random trends are kept even when a great number of samp…
CoCoDat: a database system for organizing and selecting quantitative data on single neurons and neuronal microcircuitry.
2004
We present a novel database system for organizing and selecting quantitative experimental data on single neurons and neuronal microcircuitry that has proven useful for reference-keeping, experimental planning and computational modelling. Building on our previous experience with large neuroscientific databases, the system takes into account the diversity and method-dependence of single cell and microcircuitry data and provides tools for entering and retrieving published data without a priori interpretation or summarizing. Data representation is based on the framework suggested by biophysical theory and enables flexible combinations of data on membrane conductances, ionic and synaptic current…